Research questions / Problem statement:

Infectious diseases are a very important public health issue. So we want to examine overall communicable disease rates and trends over time of Infectious diseases reported in california. Sexually transmitted diseases will be analized separately from other groups of infectious diseases.

Datasets to be used :

  1. Infectious Diseases by County, Year and Sex (in California)2001-2018 Source : https://data.chhs.ca.gov/dataset/infectious-disease Raw format of dataset: https://data.chhs.ca.gov/dataset/03e61434-7db8-4a53-a3e2-1d4d36d6848d/resource /75019f89-b349-4d5e-825d-8b5960fc028c/download/idb_odp_2001-2018.csv Name/source: CHHS Open Data Number of columns: 10 Number of rows: 154,344 2.STD’s in California by disease, vounty, year and sex. Dtasethas case counts and rates for sexually transmitted diseases (chlamydia, gonorrhea, and all forms of syphilis) reported for California residents. https://data.chhs.ca.gov/dataset/stds-in-california-by-disease-county-year-and -sex

Creating variables for data analysis:

*We created new groups of variables to facilitate data presentation and analysis. The new groups of variables are:
Name of california region, for the 10 different California regions
Type of infectious disease : to group each of thereported diseases by “type of disease” , following conventional microbiology classification.

The California regions are as follows: Superior <- “NEVADA”,“PLACER”,“PLUMAS”,“SACRAMENTO”,“SHASTA”,“SIERRA”, “SISKIYOU”,“SUTTER”,“TEHAMA”, “YOLO”, “YUBA”, “MODOC”, “EL DORADO”, “BUTTE”, “GLENN”, “LASSEN” North Coast <- “DEL NORTE”, “HUMBOLDT”, “LAKE”, “MENDOCINO”, “NAPA”,“SONOMA”, “TRINITY” Bay area<- “ALAMEDA”,“CONTRA COSTA”, “MARIN”, “SAN FRANCISCO”, “SAN MATEO”, “SANTA CLARA”, “SOLANO” North San Joaquin Valley <- “ALPINE”, “AMADOR”, “CALAVERAS”, “MADERA”, “MARIPOSA”, “MERCED”, “MONO”,“SAN JOAQUIN”, “STANISLAUS”, “TUOLUMNE” Central Coast <- “MONTEREY”, “SAN BENITO”, “SAN LUIS OBISPO”, “SANTA BARBARA”, “SANTA CRUZ”, “VENTURA” South San Joaquin Valley <- “FRESNO”,“INYO”, “KERN”, “KINGS”, “TULARE” Inland Empire<- “RIVERSIDE”, “SAN BERNARDINO” LA County <- “LOS ANGELES” Orange County <- “ORANGE” San Diego and Imperial County <- “IMPERIAL”, “SAN DIEGO” We will also have california as a total.

The groups of infectious diseases will be as follows: 1. Parasitic <- c(“Amebiasis”,“Babesiosis”, “Cryptosporidiosis”, “Cyclosporiasis”, “Cysticercosis or Taeniasis”, “Malaria”, “Giardiasis”, “Trichinosis”) 2. Toxin_related <- c(“Botulism, Foodborne”,“Botulism, Other”, “Botulism, Wound”, “Ciguatera Fish Poisoning”, “Domoic Acid Poisoning”,“Paralytic Shellfish Poisoning”, “Scombroid Fish Poisoning”) 3. viral <- c(“Chikungunya Virus Infection”, “Dengue Virus Infection”,“Flavivirus Infection of Undetermined Species”,“Hantavirus Infection”,“Hepatitis E acute infection”,“Rabies, human”,“Yellow Fever”, “Zika Virus Infection”) prions <- c(“Creutzfeldt-Jakob Disease and other Transmissible Spongiform Encephalopathies”) 4. fungal <- c(“Coccidioidomycosis”) 5. Bacterial <- c(“Anaplasmosis”, “Anaplasmosis and Ehrlichiosis”, “Anthrax”, “Brucellosis”, “Campylobacteriosis”,“Cholera”,“E. coli O157”,“E. coli Other STEC (non-O157)”, “Legionellosis”,“Leprosy (Hansen’s Disease)”, “Leptospirosis”, “Listeriosis”, “Lyme Disease”,“Plague, human”,“Q Fever”,“Spotted Fever Rickettsiosis”, “Streptococcal Infection (cases in food and dairy workers)”, “Ehrlichiosis”, “Psittacosis”, “Salmonellosis”, “Shigellosis”, “Tularemia”, “Typhoid Fever”, “Paratyphoid Fever”, “Typhus Fever”, “Relapsing Fever”, “Shiga toxin-producing E. coli (STEC) without Hemolytic Uremic Syndrome (HUS)”, “Vibrio Infection (non-Cholera)”, “Shiga Toxin Positive Feces (without culture confirmation)”,“Yersiniosis”) 6. Infectious_complications <- c(“Hemolytic Uremic Syndrome (HUS) without evidence of Shiga toxin-producing E. coli (STEC)”,“Hemolytic Uremic Syndrome (HUS)”, “Shiga toxin-producing E. coli (STEC) with Hemolytic Uremic Syndrome (HUS)”)

ID_table3californiaonlyB <- IDtable2 %>%
    filter(sex=="TOTAL",county=="CALIFORNIA",ID_type=="Bacterial"|ID_type== "Parasitic"|ID_type=="Fungal"|ID_type=="Viral")%>%
 group_by(ID_type,year)%>%
 summarize(sum_cases=sum(cases), 
           rate=sum_cases/population)
## `summarise()` regrouping output by 'ID_type', 'year' (override with `.groups` argument)
 # cbind(populationv)%>%
 # rename(pop_total="...4")%>%
 #mutate(rate_total= (sum_cases/population)*100000) 

Tables

my_table_data <- ID_tableyears_group_total %>%
  select(c("ID_type","region","rate","time_period")) %>%
  filter(ID_type=="Bacterial"|ID_type== "Parasitic"|ID_type=="Fungal"|ID_type=="Viral") %>%
  filter(region=="California")%>%
  drop_na(rate) %>%
  group_by(ID_type,time_period,region) %>%
  summarise(cumm_rate = sum (rate))
## `summarise()` regrouping output by 'ID_type', 'time_period' (override with `.groups` argument)
my_new_table_data <- my_table_data %>%
  pivot_wider(names_from=c(ID_type),values_from= "cumm_rate")
kable(my_new_table_data, 
      booktabs=T, 
      col.names=c("Time Period", " ","Bacterial", "Fungal", "Parasitic", "Viral"),  
      align='lccc', 
      caption="Infectious disease rates (Cases/100,000) over time by disease etiology (from 2001 - 2018 by 3 year increments)",
      format.args=list(big.mark=","))
Infectious disease rates (Cases/100,000) over time by disease etiology (from 2001 - 2018 by 3 year increments)
Time Period Bacterial Fungal Parasitic Viral
2001-2003 California 109.46 14.69 30.03 0.07
2004-2006 California 101.86 23.37 26.38 NA
2007-2009 California 102.86 21.00 24.11 0.10
2010-2012 California 108.82 36.58 20.97 0.55
2013-2015 California 124.43 22.77 21.15 1.04
2016-2018 California 142.98 52.33 27.80 1.76
##Mai
my_bayarea_table <- ID_tableyears_group_total %>%
  select(c("ID_type","region","rate","time_period")) %>%
  filter(ID_type=="Bacterial"|ID_type== "Parasitic"|ID_type=="Fungal"|ID_type=="Viral") %>%
  filter(region=="Bay_area") %>%
  drop_na(rate) %>%
 #group_by(ID_type, time_period)%>%
  group_by(ID_type,time_period,region) %>%
  summarise(cumm_rate = sum (rate))
## `summarise()` regrouping output by 'ID_type', 'time_period' (override with `.groups` argument)
my_new_bayarea_table <- my_bayarea_table %>%
  pivot_wider(names_from=c(ID_type),values_from= "cumm_rate")
# Table for  bay area only :
kable(my_new_bayarea_table, 
      booktabs=T, 
      col.names=c("Time_Period", " ", "Bacterial", "Fungal", "Parasitic", "Viral"),
      align='lccc', 
      caption="Infectious disease rates over time in the Bay Area from 2001-2018 by etiology of 
     disease and time period (3 year cummulatives)",
      format.args=list(big.mark=","))
Infectious disease rates over time in the Bay Area from 2001-2018 by etiology of disease and time period (3 year cummulatives)
Time_Period Bacterial Fungal Parasitic Viral
2001-2003 Bay_area 945.44 NA 342.40 NA
2004-2006 Bay_area 947.73 10.53 306.70 NA
2007-2009 Bay_area 932.71 9.76 254.58 NA
2010-2012 Bay_area 992.36 14.77 250.65 NA
2013-2015 Bay_area 1,136.90 20.89 228.59 NA
2016-2018 Bay_area 1,300.47 42.55 343.04 3.47

Figures

#Figures:

# Lourdes. Create a table showing trends of group of diseases by time_period  in California. From all groups of diseases
# Creating table for graph:
ID_tableyears_group_total_fig1L <- ID_tableyears_group_total %>%
 filter(ID_type=="Bacterial"|ID_type=="Parasitic"|ID_type=="Fungal"|ID_type=="Viral")%>%
  select(c("ID_type","region","rate","time_period")) %>%
         filter(region=="California")%>%
  drop_na(rate) %>%
  group_by(ID_type,time_period,region) %>%
  #summarise(cumm_rate = sum (rate))

ggplot(ID_tableyears_group_total_fig1L, aes(x = ID_type, y = cumm_rate)) +
geom_bar(aes(fill=time_period), stat="identity", position = position_dodge()) +
#geom_col(aes(fill=), col)
  scale_y_continuous(labels = function(x) format(x,bigmark=",",scientific=FALSE))+
  #scale_fill_manual(name= "time_period") +
  scale_fill_discrete(name= "time_period")+
  #                  ,
  #values=c("#ffd333","#ff6600","#be0f24","#f91cc7","#910ff5","#003884")) +
labs(x="Group of Infectious disease", y = "Cummulative rate",
title = "Figure 1: Trend of infectious diseases over time from 2001-2018 by type of 
     Disease and time period (3 year averages)")

Figure 1 shows that of the reported infectious diseases (excluding sexually transmitted diseases) that are most commonly reported are Bacterial diseases, followed by Fungal, and then parasitic diseases. Viral diseases have a lower rate. These numbers do not necessarily translates into real prevalence since many diseases are not considered “reportable”, due to their common prevalence and ubiquitous distribution.

#Figure 2L Trends from 2001-2018 only for Bacterial, Fungal and Parasitic conditions
ID_tableyears_group_total_fig2L <- ID_tableyears_group_total %>%
  select(c("ID_type","region","rate","year")) %>%
  mutate(year=as.character(year)) %>%
         filter(region=="California")%>%
  filter(ID_type=="Bacterial"|ID_type== "Parasitic"|ID_type=="Fungal"|ID_type=="Viral") %>%
    drop_na(rate) %>%
  group_by(ID_type,year,region) %>%
  summarise(cumm_rate = sum (rate)) 
## `summarise()` regrouping output by 'ID_type', 'year' (override with `.groups` argument)
#Figure 2 corrected
Figure2L_table <- ID_table3californiaonly


#using plotly
plot_ly(
  ID_tableyears_group_total_fig2L,
  x= ~year,
  y= ~cumm_rate,
  color= ~ID_type,
  type="bar"
) %>%
  layout(barmode="stack")%>%
  
  layout(
    title = "Figure 2 : Trends of Bacterial Fungal and Parasitic Diseases
    rates per 100,000 from 2001-2018 in California
    (Excludes STD's)",
    xaxis = list(title = "Years"),
    yaxis = list(title = "California Rates per 100,000")
  )
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Figure 2: We could tell thet through the years, reports of bacterial diseases have increased overtime. Reasons for this increase could be related to a real increase of reportable cases, versus improved report methodology. The same goes to Fungal infections. Parasitic infections have ranged from 4.6/100,000 in 2002 to 19.33/100,000 in 2017.

#Lourdes most common bacterial diseases from 2001 to 2018:
ID_mostcommon_bacterial_fig3L <- IDtable2 %>%
  #select(c("disease","ID_type","region","rate","year","sex","cases","population")) %>%
  mutate(year=as.character(year)) %>%
         filter(region=="California")%>%
  filter(sex=="TOTAL") %>%
  filter(ID_type=="Bacterial") %>%
      drop_na(rate) %>%
 group_by(disease) %>%
  summarise(cumm_rate = sum (rate)) %>%
  #summarize(cum_disease = sum(disease)) %>%
#mutate(average_cases= (cum_disease/18))
filter(cumm_rate >= 5)
## `summarise()` ungrouping output (override with `.groups` argument)
#using plotly
plot_ly(
  ID_mostcommon_bacterial_fig3L,
  x= ~disease,
  y= ~cumm_rate,
  color= ~disease,
  type="bar") %>%
# layout(barmode="stack")%>%
  
  layout(
    title = "Most common reported Bacterial Infection
     aprox cummulative rates per 100,000 from 2001-2018 in California
    (Excludes STD's)",
    xaxis = list(title = "Bacterial Infections"),
    yaxis = list(title = "California Rates per 100,000")
  )
#Lourdes; will create a new bar chart for most common reported Parasitic diseases rates i California
ID_mostcommon_parasitic_fig4L <- IDtable2 %>%
  select(c("disease","ID_type","region","rate","year" ))%>%
  mutate(year=as.character(year)) %>%
         filter(region=="California")%>%
  filter(ID_type=="Parasitic") %>%
  #filter(year=="2018")
    drop_na(rate) %>%
 # group_by(disease,year) %>%
  group_by(disease) %>%
  summarise(cumm_rate = sum (rate)) %>%
  filter(cumm_rate >= 1)
## `summarise()` ungrouping output (override with `.groups` argument)
#using plotly
plot_ly(
  ID_mostcommon_parasitic_fig4L,
  x= ~disease,
  y= ~cumm_rate,
  color= ~disease,
  type="bar") %>%
# layout(barmode="stack")%>%
  
  layout(
    title = "Most common reported Parasitic Diseases
    rates per 100,000 from 2001-2018 in California",
    xaxis = list(title = "Parasitic Infections"),
    yaxis = list(title = "California Rates per 100,000")
  )
#Lourdes; will create a new bar chart for most common reported fungal diseases rates in California
ID_mostcommon_fungal_fig5L <- IDtable2 %>%
  select(c("disease","ID_type","region","rate","year")) %>%
  mutate(year=as.character(year)) %>%
         filter(region=="California")%>%
  filter(ID_type=="Fungal") %>%
  #filter(year=="2018")
   # drop_na(rate) %>%
 # group_by(disease,year) %>%
  group_by(disease) %>%
  #summarise(cumm_rate = sum (rate))%>%
  #mutate(average_rate <- by(IDtable3))
 filter(cumm_rate >= 0.0001)
#using plotly
plot_ly(
  ID_mostcommon_fungal_fig5L,
  x= ~disease,
  y= ~cumm_rate,
  color= ~disease,
  type="bar")%>%
# layout(barmode="stack")%>%
  
  layout(
    title = "Most common reported Fungal Infections
    rates per 100,000 from 2001-2018 in California
    (Excludes STD's)",
    xaxis = list(title = "Fungal infections"),
    yaxis = list(title = "California Rates per 100,000")
  )
#Lourdes; will create a new bar chart for most common reported viral diseases rates i California
ID_mostcommon_viral_fig6L <- IDtable2 %>%
  select(c("disease","ID_type","region","rate","year")) %>%
  mutate(year=as.character(year)) %>%
         filter(region=="California")%>%
  filter(ID_type=="Viral") %>%
  #filter(year=="2018")
    drop_na(rate) %>%
 # group_by(disease,year) %>%
  group_by(disease) %>%
  summarise(cumm_rate = sum (rate)) %>%
  filter(cumm_rate > 0.05)
## `summarise()` ungrouping output (override with `.groups` argument)
#using plotly
plot_ly(
  ID_mostcommon_viral_fig6L,
  x= ~disease,
  y= ~cumm_rate,
  color= ~disease,
  type="bar")%>%
# layout(barmode="stack")%>%
  
  layout(
    title = "Most common reported Viral Infections
    rates per 100,000 from 2001-2018 in California
    (Excludes STD's)",
    xaxis = list(title = "Viral infections"),
    yaxis = list(title = "California Rates per 100,000")
  )
#sandya-create disease trend over time

individualdata1<- read_csv("stds-by-disease-county-year-sex.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Disease = col_character(),
##   County = col_character(),
##   Year = col_double(),
##   Sex = col_character(),
##   Cases = col_double(),
##   Population = col_double(),
##   Rate = col_double(),
##   `Lower 95% CI` = col_double(),
##   `Upper 95% CI` = col_double(),
##   `Annotation Code` = col_character()
## )
groupdata<-read_csv("idb_odp_2001-2018 (1) (1).csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Disease = col_character(),
##   County = col_character(),
##   Year = col_double(),
##   Sex = col_character(),
##   Cases = col_double(),
##   Population = col_double(),
##   `Lower 95% CI` = col_double(),
##   `Upper 95% CI` = col_double(),
##   Rate = col_character()
## )
groupdatafinal<-filter(groupdata, County %in% c("ALAMEDA", "SANTA CLARA", "SAN MATEO", "SAN FRANCISCO", "MARIN", "CONTRA COSTA", "SOLANO"))

testing<-3
  
individualdatafinal<-filter(individualdata1, County %in% c("Alameda", "Santa Clara", "San Mateo", "San Francisco", "Marin", "Contra Costa", "Solano"))%>% select(-10)


combineddata<-rbind(groupdatafinal, individualdatafinal)

combineddata$County<-tolower(combineddata$County)
combineddata$Sex<-tolower(combineddata$Sex)

combineddata1<-combineddata%>%filter(Sex %in% "total")

combinedatafinal<-combineddata1%>%mutate(Disease_Type = case_when(Disease %in% c("Gonorrhea", "Early Syphilis", "Chlamydia")~ "STD (Bacterial)", Disease %in% c("Shiga toxin-producing E. coli (STEC) with Hemolytic Uremic Syndrome (HUS)", "Shiga toxin-producing E. coli (STEC) without Hemolytic Uremic Syndrome (HUS)", "Anaplasmosis and Ehrlichiosis", "Hemolytic Uremic Syndrome (HUS) without evidence of Shiga toxin-producing E. coli (STEC)", "Paratyphoid Fever", "Ehrlichiosis", "Anaplasmosis" ,"Shiga Toxin Positive Feces (without culture confirmation)", "E. coli Other STEC (non-O157)",  "Yersiniosis"  ,"Vibrio Infection (non-Cholera)" , "Typhoid Fever", "Typhus Fever",  "Tularemia"  ,"Streptococcal Infection (cases in food and dairy workers)"  ,"Spotted Fever Rickettsiosis" , "Shigellosis","Salmonellosis", "Relapsing Fever" , "Q Fever" , "Psittacosis" , "Plague, human","Lyme Disease", "Listeriosis"  ,"Leptospirosis","Leprosy (Hansen's Disease)" ,"Legionellosis","E. coli O157" ,"Cholera"  , "Campylobacteriosis"  ,"Brucellosis" , "Anthrax" )~ "Bacteria", Disease %in% c("Chikungunya Virus Infection", "Dengue Virus Infection","Flavivirus Infection of Undetermined Species", "Hantavirus Infection","Hepatitis E, acute infection", "acute infection", "Rabies, human","Yellow Fever", "Zika Virus Infection")~ "Virus", Disease %in% c("Amebiasis","Babesiosis", "Cryptosporidiosis", "Cyclosporiasis", "Cysticercosis or Taeniasis", "Malaria", "Giardiasis", "Trichinosis")~"Protozoa", Disease %in% c( 'Botulism, Foodborne', 'Botulism, Other', 'Botulism, Wound', 'Ciguatera Fish Poisoning', 'Domoic Acid Poisoning',  'Paralytic Shellfish Poisoning', 'Scombroid Fish Poisoning')~ "Toxin", Disease %in% c('Creutzfeldt-Jakob Disease and other Transmissible Spongiform Encephalopathies')~ "Prion", Disease ==  'Hemolytic Uremic Syndrome (HUS)'~ "Infectious Complication", Disease == "Coccidioidomycosis"~ "Fungal"))
data1<-combinedatafinal%>%select(-9)

data1$Rate<-(data1$Cases/data1$Population)*100000

plotdatatest<-data1%>%group_by(Disease_Type, Year)%>%summarize(Sum_cases=sum(Cases))
## `summarise()` regrouping output by 'Disease_Type' (override with `.groups` argument)
testin<-data1%>%group_by(County, Year)%>%summarize(total_pop=mean(Population))%>%group_by(Year)%>%summarise(totalp=sum(total_pop))
## `summarise()` regrouping output by 'County' (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
#plotdata1$Real_Rate<-

hope<-left_join(testin, plotdatatest, by= "Year")
hope$Overall_Rate<-(hope$Sum_cases/hope$totalp)*100000
#plotdata<-data1%>%group_by(Disease_Type, #Year)%>%summarize(Mean_Rate=mean(Rate))
ggplot(hope, aes(x=Year, y=Overall_Rate))+facet_wrap(vars(Disease_Type), ncol = 2)+geom_line(aes(color = Disease_Type))+labs(x="Year", y="Mean Rate", title = "Overall Rates of Infectious diseases (including STDs) in Bay Area Counties from 2001-2018") + theme_minimal()

Interpretation of graph: This graph looks at the overall rates per year of different types of infectious disease in the Bay area counties from 2001 to 2018. From the graphs it is noticeable that STD rates are increasing at a much greater level than other types of infectious diseases.

plotdata2<-hope%>% filter(Disease_Type != "STD (Bacterial)")


ggplot(plotdata2, aes(x=Year, y=Overall_Rate))+geom_line(aes(color = Disease_Type))+facet_wrap(vars(Disease_Type), ncol = 2)+labs(x="Year", y="Overall Rate", title = "Overall rates of Infectious Diseases in Bay Area Counties from 2001-2018") + theme_minimal()

Interpretation of graph: This graph looks at the overall rates per year of different types of infectious disease from 2001 to 2018 in the Bay area counties (excluding STDs). I created this graph to better visualize the trends in diseases apart from STDs. From the graphs it is noticeable that fungla rates are increasing overtime whereas Toxins, Priosns, Infectious complications remain at a very low steady level. Bacterial infection rates remain higher than the other types of disease but seem to be at a steady rate over time.

plotdata3<-hope%>% filter(Disease_Type == "STD (Bacterial)")

ggplot(plotdata3, aes(x=Year, y=Overall_Rate))+geom_line(stat="identity", color= "#910ff5")+facet_wrap(vars(Disease_Type))+labs(x="Year", y="Overall Rate", title = "Overall rates per year of STDs (Bacterial) in Bay Area Counties from 2001-2018") +theme_minimal()

Interpretation of graph: This graph looks at the Overall rates per year of bacterial STD infectious disease per year in the Bay area counties from 2001 to 2018. I created this graph to better visualize the trends in STDs.The graphs shows a very significant increase in the overall rate of STDs. however from 2013 and onward, we see an even greater increase in rates.

my_final_try <- 1